Airbnb Listing Analysis by Gisle Tveit Gaasemyr

The dataset I have chosen consists of airbnb listing data from San Francisco collected at 28 different dates between November 2013 and July 2017. I originally downloaded the datasets from http://tomslee.net/airbnb-data-collection-get-the-data as individual files for each date, which I then combined into a single file. For more details on this, see union_script.R in the p6 folder.

For details on the content of the columns, see http://tomslee.net/airbnb-data-collection-get-the-data.

Univariate Plots Section

I will start my analysis by taking a look at some records, to get an initial feel of the dataset.

##   room_id host_id       room_type borough          neighborhood reviews
## 1    8014   22402    Private room      NA         Outer Mission      15
## 2   10832   38836 Entire home/apt      NA Downtown/Civic Center       2
## 3   26488  112300 Entire home/apt      NA    Financial District      48
## 4   45900  204441 Entire home/apt      NA    Financial District     158
## 5   54518  255971 Entire home/apt      NA    Financial District       5
## 6   56489   40784 Entire home/apt      NA       South of Market       2
##   overall_satisfaction accommodates bedrooms price minstay latitude
## 1                  4.5            1        4    49       2 37.73075
## 2                  5.0            4        1   172      30 37.78590
## 3                  5.0            2        1  1097     360 37.79090
## 4                  4.5            6        3   219       1 37.78858
## 5                  4.5            2        1   187      30 37.79330
## 6                  5.0            4        2  1099      30 37.78915
##   longitude       last_modified date_collected
## 1 -122.4484 2013-12-08 14:09:52     2013-11-17
## 2 -122.4083 2013-12-08 18:54:01     2013-11-17
## 3 -122.3933 2013-12-07 23:36:57     2013-11-17
## 4 -122.4048 2013-12-07 03:09:04     2013-11-17
## 5 -122.4008 2013-12-07 03:32:39     2013-11-17
## 6 -122.3895 2013-12-07 01:48:46     2013-11-17
##         room_id   host_id    room_type borough    neighborhood reviews
## 205527 13886867  19206511 Private room      NA      Ocean View       8
## 205528 15481646    160490 Private room      NA  Haight Ashbury       0
## 205529 14268636  24091826 Private room      NA       Excelsior      49
## 205530  9033827  37570945 Private room      NA       Excelsior       3
## 205531 15605282   5020080 Private room      NA  Inner Richmond      12
## 205532 18089644 124497978 Private room      NA South of Market       0
##        overall_satisfaction accommodates bedrooms price minstay latitude
## 205527                  4.5            2        1    40      NA 37.71325
## 205528                  0.0            1        1    10      NA 37.76497
## 205529                  5.0            2        1    39      NA 37.72478
## 205530                  4.5            1        1    40      NA 37.72581
## 205531                  5.0            2        1    38      NA 37.78226
## 205532                  0.0            1        1    10      NA 37.77123
##        longitude last_modified date_collected
## 205527 -122.4582       46:31.1     2017-07-10
## 205528 -122.4520       46:31.1     2017-07-10
## 205529 -122.4319       46:31.1     2017-07-10
## 205530 -122.4038       46:31.1     2017-07-10
## 205531 -122.4778       46:31.1     2017-07-10
## 205532 -122.4042       46:31.1     2017-07-10

Next I’ll get some information about the data types.

## 'data.frame':    205532 obs. of  15 variables:
##  $ room_id             : int  8014 10832 26488 45900 54518 56489 64332 70284 70753 71370 ...
##  $ host_id             : int  22402 38836 112300 204441 255971 40784 40784 329072 329072 364983 ...
##  $ room_type           : Factor w/ 4 levels "","Entire home/apt",..: 3 2 2 2 2 2 2 4 4 2 ...
##  $ borough             : logi  NA NA NA NA NA NA ...
##  $ neighborhood        : Factor w/ 37 levels "Bayview","Bernal Heights",..: 22 7 9 9 9 32 32 4 9 22 ...
##  $ reviews             : int  15 2 48 158 5 2 14 8 70 10 ...
##  $ overall_satisfaction: num  4.5 5 5 4.5 4.5 5 4 4.5 4.5 4.5 ...
##  $ accommodates        : int  1 4 2 6 2 4 6 1 4 6 ...
##  $ bedrooms            : int  4 1 1 3 1 2 2 4 1 2 ...
##  $ price               : int  49 172 1097 219 187 1099 350 27 30 131 ...
##  $ minstay             : int  2 30 360 1 30 30 2 30 1 3 ...
##  $ latitude            : num  37.7 37.8 37.8 37.8 37.8 ...
##  $ longitude           : num  -122 -122 -122 -122 -122 ...
##  $ last_modified       : Factor w/ 189810 levels "00:00.4","00:02.4",..: 1959 1986 1914 1832 1838 1819 1886 1934 1805 1999 ...
##  $ date_collected      : Factor w/ 28 levels "2013-11-17","2014-05-11",..: 1 1 1 1 1 1 1 1 1 1 ...
#Setting correct format for dates
airbnb$date_collected <- as.Date(airbnb$date_collected)

I now want to get some summary statistics for the different fields in the dataframe.

##     room_id            host_id                    room_type     
##  Min.   :     958   Min.   :       46                  :    43  
##  1st Qu.: 2433743   1st Qu.:  2597104   Entire home/apt:121681  
##  Median : 6988943   Median :  8539143   Private room   : 76404  
##  Mean   : 7020337   Mean   : 18201785   Shared room    :  7404  
##  3rd Qu.:10772374   3rd Qu.: 25805964                           
##  Max.   :19781990   Max.   :139553832                           
##                     NA's   :6                                   
##  borough                       neighborhood       reviews      
##  Mode:logical   Mission              : 25487   Min.   :  0.00  
##  NA's:205532    Western Addition     : 20223   1st Qu.:  1.00  
##                 South of Market      : 15860   Median :  5.00  
##                 Castro/Upper Market  : 12098   Mean   : 21.07  
##                 Downtown/Civic Center: 11404   3rd Qu.: 22.00  
##                 Haight Ashbury       : 10524   Max.   :513.00  
##                 (Other)              :109936   NA's   :49      
##  overall_satisfaction  accommodates       bedrooms          price      
##  Min.   :0.00         Min.   : 1.000   Min.   : 0.000   Min.   :    0  
##  1st Qu.:4.50         1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.:  108  
##  Median :5.00         Median : 2.000   Median : 1.000   Median :  167  
##  Mean   :3.96         Mean   : 3.082   Mean   : 1.346   Mean   :  252  
##  3rd Qu.:5.00         3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:  256  
##  Max.   :5.00         Max.   :18.000   Max.   :10.000   Max.   :30000  
##  NA's   :46354        NA's   :8421     NA's   :10690                   
##     minstay           latitude       longitude     
##  Min.   :   1.00   Min.   :37.71   Min.   :-122.5  
##  1st Qu.:   1.00   1st Qu.:37.75   1st Qu.:-122.4  
##  Median :   2.00   Median :37.77   Median :-122.4  
##  Mean   :   3.53   Mean   :37.77   Mean   :-122.4  
##  3rd Qu.:   3.00   3rd Qu.:37.79   3rd Qu.:-122.4  
##  Max.   :1000.00   Max.   :37.83   Max.   :-122.4  
##  NA's   :71477                                     
##                     last_modified    date_collected      
##  2015-08-21 16:54:47.397989:  3974   Min.   :2013-11-17  
##  48:03.8                   :    24   1st Qu.:2016-02-17  
##  25:46.6                   :    18   Median :2016-07-17  
##  49:41.8                   :    18   Mean   :2016-07-02  
##  40:27.5                   :    17   3rd Qu.:2017-01-14  
##  43:16.5                   :    17   Max.   :2017-07-10  
##  (Other)                   :201464

I am already seeing some trends in the data. For example, based on the 1st quartile and max value for overall_satisfication, I suspect that very few listings have ratings below 4 of 5. Let’s plot this to take a closer look.

The plot confirms my suspicion: 81.9% of the ratings are 4 or higher, with more than 50% of the ratings being 5. However, it is interesting to see that 16.8% of the rooms have a rating of 0. Let’s take a closer look at some summary statistics for these records to see if there’s a data quality issue.

##     room_id            host_id                    room_type    
##  Min.   :    6810   Min.   :      316                  :    0  
##  1st Qu.: 8307860   1st Qu.:  4746287   Entire home/apt:15508  
##  Median :10948458   Median : 16570082   Private room   :10570  
##  Mean   :11360472   Mean   : 28935038   Shared room    :  597  
##  3rd Qu.:15529178   3rd Qu.: 44780730                          
##  Max.   :19781990   Max.   :139553832                          
##                                                                
##                 neighborhood      reviews       overall_satisfaction
##  Mission              : 2854   Min.   :0.0000   Min.   :0           
##  Western Addition     : 2483   1st Qu.:0.0000   1st Qu.:0           
##  South of Market      : 2475   Median :0.0000   Median :0           
##  Downtown/Civic Center: 2089   Mean   :0.5638   Mean   :0           
##  Haight Ashbury       : 1322   3rd Qu.:1.0000   3rd Qu.:0           
##  Bernal Heights       : 1285   Max.   :6.0000   Max.   :0           
##  (Other)              :14167                                        
##   accommodates       bedrooms          price            minstay     
##  Min.   : 1.000   Min.   : 0.000   Min.   :   10.0   Min.   : NA    
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.:  100.0   1st Qu.: NA    
##  Median : 2.000   Median : 1.000   Median :  180.0   Median : NA    
##  Mean   : 3.202   Mean   : 1.375   Mean   :  309.2   Mean   :NaN    
##  3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:  300.0   3rd Qu.: NA    
##  Max.   :16.000   Max.   :10.000   Max.   :30000.0   Max.   : NA    
##                                                      NA's   :26675  
##     latitude       longitude      last_modified   date_collected      
##  Min.   :37.71   Min.   :-122.5   45:06.5:   15   Min.   :2016-12-23  
##  1st Qu.:37.76   1st Qu.:-122.4   25:57.8:   14   1st Qu.:2017-01-14  
##  Median :37.77   Median :-122.4   25:28.6:   13   Median :2017-03-12  
##  Mean   :37.77   Mean   :-122.4   47:37.9:   13   Mean   :2017-03-17  
##  3rd Qu.:37.79   3rd Qu.:-122.4   25:46.7:   12   3rd Qu.:2017-04-08  
##  Max.   :37.83   Max.   :-122.4   40:32.5:   12   Max.   :2017-07-10  
##                                   (Other):26596

The first thing I notice is that there seem to be a large amount of records with 0 reviews. According to the dataset description the overall_satisfaction consists of “The average rating (out of five) that the listing has received from those visitors who left a review.” We can therefore assume that records with 0 reviews should have rating set to NA, not 0-5. Let’s take a look at how many records have 0 reviews and an overall_satisfaction between 0 and 5.

## [1] 15945

About 8% of the records match this criteria. Let’s plot the overall_satisfaction for these records.

It looks like all 0 reviews records with a value in overall_satisfaction has it set to 0. I’ll run a filtered querry to make sure.

##   room_id host_id    room_type   neighborhood reviews overall_satisfaction
## 1 1097480 4955917 Private room Outer Richmond       0                    3
## 2 1097480 4955917 Private room Outer Richmond       0                    3
## 3 1097480 4955917 Private room Outer Richmond       0                    3
##   accommodates bedrooms price minstay latitude longitude
## 1            1        1   123       5 37.77496 -122.5009
## 2            1        1   169       5 37.77496 -122.5009
## 3            1        1   192       5 37.77496 -122.5009
##             last_modified date_collected
## 1 2014-05-11 23:15:02.110     2014-05-11
## 2 2014-08-24 23:50:05.640     2014-08-24
## 3 2015-02-19 11:15:41.282     2015-02-19

3 out of 205k records is practically 0. I will now plot the overall_satisfaction column again excluding the 0 reviews records. I will also update the airbnb dataframe and change the overall_satisfaction score from 0 to NA for these records.

When excluding the zero reviews records, the percentage of records with an overall ranking of 0 goes down to 7.5. That’s still a fair amount, but it’s much more believeable than before. Let’s run a summary query on these records to see if we can spot any trends.

##     room_id            host_id                    room_type   
##  Min.   :    6810   Min.   :      316                  :   0  
##  1st Qu.: 7764330   1st Qu.:  3990783   Entire home/apt:6115  
##  Median :11377038   Median : 13488074   Private room   :4398  
##  Mean   :11053131   Mean   : 26404572   Shared room    : 220  
##  3rd Qu.:15222764   3rd Qu.: 39631194                         
##  Max.   :19592343   Max.   :137719050                         
##                                                               
##                 neighborhood     reviews      overall_satisfaction
##  Mission              :1273   Min.   :1.000   Min.   :0           
##  South of Market      : 917   1st Qu.:1.000   1st Qu.:0           
##  Western Addition     : 913   Median :1.000   Median :0           
##  Downtown/Civic Center: 861   Mean   :1.401   Mean   :0           
##  Bernal Heights       : 574   3rd Qu.:2.000   3rd Qu.:0           
##  Haight Ashbury       : 511   Max.   :6.000   Max.   :0           
##  (Other)              :5684                                       
##   accommodates       bedrooms          price            minstay     
##  Min.   : 1.000   Min.   : 0.000   Min.   :   10.0   Min.   : NA    
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.:   99.0   1st Qu.: NA    
##  Median : 2.000   Median : 1.000   Median :  155.0   Median : NA    
##  Mean   : 3.074   Mean   : 1.322   Mean   :  228.1   Mean   :NaN    
##  3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:  250.0   3rd Qu.: NA    
##  Max.   :16.000   Max.   :10.000   Max.   :10000.0   Max.   : NA    
##                                                      NA's   :10733  
##     latitude       longitude      last_modified   date_collected      
##  Min.   :37.71   Min.   :-122.5   29:05.3:    7   Min.   :2016-12-23  
##  1st Qu.:37.76   1st Qu.:-122.4   46:58.0:    7   1st Qu.:2017-01-14  
##  Median :37.77   Median :-122.4   29:06.9:    6   Median :2017-03-12  
##  Mean   :37.77   Mean   :-122.4   40:33.6:    6   Mean   :2017-03-16  
##  3rd Qu.:37.79   3rd Qu.:-122.4   46:59.7:    6   3rd Qu.:2017-04-08  
##  Max.   :37.83   Max.   :-122.4   47:32.0:    6   Max.   :2017-07-10  
##                                   (Other):10695

What stands out to me is that a large portion of these records have only one review. Let’s compare the amount of reviews of these records with the full dataset.

Almost all the listings with overall_satisfaction score 0 have either 1 or 2 reviews. Next let’s take a look at the price distribution.

Looks like there’s some extreme outliers for the price variable. Let’s zoom in to get a better sense of price distribution

##   95%   96%   97%   98%   99%  100% 
##   650   750   900  1000  1500 30000

The 98th price percentile is at 1000, and I’ll use that as the price cutoff to zoom in on the data.

Setting the max price to 1000, which excludes the 2% most expensive units, gives a clearer picture of the price distribution. The plot is heavily right-skewed, with most units being priced below 250.

I suspect size of the units heavily affect price. Therefore I will make a new column for price per bedroom.

# Creating the column
airbnb$price_per_bedroom = airbnb$price / airbnb$bedrooms

# Removing infinite values
is.na(airbnb$price_per_bedroom) <- do.call(cbind,lapply(airbnb$price_per_bedroom, is.infinite))

airbnb %>% 
  group_by(room_type) %>% 
  summarise(n = n(), avg_price = mean(price, na.rm = TRUE), 
            avg_bedroom_price = mean(price_per_bedroom, na.rm = TRUE))
## # A tibble: 4 x 4
##         room_type      n avg_price avg_bedroom_price
##            <fctr>  <int>     <dbl>             <dbl>
## 1                     43 162.60465               NaN
## 2 Entire home/apt 121681 336.69836         205.42191
## 3    Private room  76404 132.44443         131.93803
## 4     Shared room   7404  95.03025          93.30906

Price_per_bedroom percentiles:

##   95%   96%   97%   98%   99%  100% 
##   375   400   491   550   800 28000

Unsurprisingly, the trend is very similar for the price and price_per_bedroom plots: the distribution is heavily right-skewed.

Almost two thirds of all units are one-room rentals. Room type is probably a factor here. I will revisit this in the bivariate plots section.

Almost all units have either room_type Entire home/apt and Private room. Later on it will be interesting to see if this changes over time.

Minstay percentiles:

##  95%  96%  97%  98%  99% 100% 
##   10   14   28   30   30 1000

It looks like most Airbnb hosts require a minimum stay between 1 and 3 nights, with few units requiring more than 7 nights. It is interesting to see that many more units have a minimum stay of 7 and 30 nights than 6 nights, which is natural considering there are 7 days in a week, and (roughly) 30 days in a month. There is also a slight increase at 10 days (round number) and 14 days (2 weeks).

The neighborhood distribution is very spread, with some very large and some very small neighborhoods. Although it is outside the scope of this project, it would be interesting to compare the neighborhood proportions in the airbnb dataset to population and housing data from San Francisco. This could reveal whether being an airbnb host is much more common in some neighborhoods compared to other neighborhoods.

While it is most common for units to accommodate 2 people, accommodating 4 people is also rather common. Few listings accommodate more than 6 people.

Univariate Analysis

What is the structure of your dataset?

There are 205,532 Airbnb listings in the dataset. The listings were collected at 28 different dates between November 2013 and October 2017. Each record consists of 6 numeric, 2 nominal, and 2 ID variables (room_id and host_id). There are also 2 date columns and 2 columns with geographical location (longitude and lattitude).

What is/are the main feature(s) of interest in your dataset?

From my initial analysis I consider price and date to be the main features of interest in the data set.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Price variations across neighborhoods will be interesting to take a look at. I believe number of bedrooms and room type are other interesting parameters which will greatly affect the price.

Did you create any new variables from existing variables in the dataset?

Yes, I created price per bedroom, as that seemed to be the best indicator of size. Without taking unit size into account it is very hard to get a true price picture.

Of the features you investigated, were there any unusual distributions?

It is a little surprising to me how large the portion of 1 bedroom apartments is. More than half of the units on the market fit into this category. There’s a chance some of these records really inform about the number of rooms available, not the actual amount of rooms in the unit. It will be interesting to later take a look at the bedroom distribution across room type.

Did you perform any operations on the data to tidy, adjust, or change the
form of the data? If so, why did you do this?

Yes. I initially downloaded 28 separate files which I merged to one file using the tidy function union. For full details about this process, see https://github.com/gisledb/udacity_nanodegree/blob/master/p6_data_visualization_with_Tableau/Project/union_script.R.

In the file you are reading now I have changed the data type of the date_collected column from factor to date. I also removed one “dead” variable, borough, as all the values were set to NA. During my analysis I found a data error which I corrected: 15 945 records with 0 reviews did not have overall_satisfaction set to NA, and >99.9% of these records had overall_satisfaction set to 0. I changed the overall_satisfaction of these records to NA.

Bivariate Plots Section

In this section, among other things I want to take a closer look at the relationship between price and neighborhood, and price and time.

Before I go any further, I want to make sure the neighborhood data is correct. Since the dataset contains latitude and longitude data, this is fairly easy to check using a map plot.

This is the result I was hoping for. The data points are nicely clustered within their respective neighborhoods. Based on local knowledge I can confirm that the names of the neighborhoods are correct.

## # A tibble: 37 x 6
##          neighborhood     n     mean   med   min   max
##                <fctr> <int>    <dbl> <dbl> <dbl> <dbl>
##  1   Presidio Heights  1039 494.3898   200    49 10000
##  2           Presidio   155 439.9355   195    60  1500
##  3       Russian Hill  5925 355.0160   212    31  9995
##  4    Pacific Heights  5645 351.4376   220    10  9900
##  5             Marina  7670 322.4735   225    20  8964
##  6 Financial District  3313 312.1150   180    10 28000
##  7    South of Market 15860 303.3232   170     0 30000
##  8          Chinatown  2781 291.6559   179    19 10000
##  9       Potrero Hill  6720 287.4071   190    36  9000
## 10        North Beach  4169 277.6697   195    10  9999
## # ... with 27 more rows

Presidio Heights and Presidio are clearly the most expensive neighborhoods, with an average price per night above $440. The most affordable neighborhods are Treasure Island, Crocker Amazon and Lakeshore. This makes sense, as they are all on the outskirts of the San Francisco city limits. Based on my local knowledge of the city there are no surprises in this figure.

When we compare the price to price per bedroom, Presidio Heights is no longer the most expensive neighborhoods. Now it seems like Downtown and the Financial District has the most expensive units. One thing to note is that this plot excludes any units with zero bedrooms (roughly 13% of the records). Let’s see if looking at median instead of mean changes things.

Looking at median instead of mean, the situation changes quite a bit. For absolute price, Presidio Heights and Presidio drop to 4th and 5th places, while Marina, Pacific Heights and Russian Hill now occupy the top 3. Presidio has the highest price per bedroom, with Marina and Chinatown having the 2nd and 3rd highest bedroom price.

Comparing the mean and median plots, it seems like some neighborhoods have some very expensive units, increasing the mean values. Especially Presidio and Presidio Heights seems to be affected by this. It doesn’t seem to be same case the other way around, as all neighborhoods seem to have a higher mean than median price. Lets create another plot to confirm this.

As expected, no neighborhood has a higher median price than mean price.

In this plot, which compares median and mean price per neighborhood, we see that the median price in Presidio Heights is almost 2.5 times greater than the mean price. From the univariate plot section I already know that the dataset has some large price outliers. A box plot might reveal whether the outliers are distributed across most or only a few neighborhoods.

The otuliers are so dominating that it is hard to recognize this being a box plot. Let’s exclude the top percentile and run the box plots again.

Even after excluding the most extreme (top 1%) prices most neighborhoods seem to have a lot of large outliers. Many neighborhoods have a relatively large spread of prices, and the prices for most neighborhoods are right-skewed.

Except for one outlier date in 2015, the mean price have stayed fairly consistent roughly between 210 and 270,over the whole time period in the dataset (late 2013 to mid-2017). This makes me curious about whether there was a special event in San Francisco at the time the data was collected which increased the prices dramatically, or if this is due to some outliers. Let’s see if the median price follows the same trend.

At first glance the median price appears to be much more volatile than the mean price, but this is mainly due to the y axis being much more narrow in this last plot. Let’s plot the statistics together to get a clearer picture.

Here we see that median price in general follows the trend of the mean price. One interesting exception is the outlier date from the mean chart, which is not an outlier for median price. To me this indicates that the 2015 outlier date has some very high outlier prices. Let’s investigate this further.

There is not an unusal number of records for any dates in 2015, so record count is probably not a relevant factor. Next I’ll take a closer look at the data for the mean price outlier date.

## # A tibble: 6 x 6
##   date_collected     n     mean median   min   max
##           <date> <int>    <dbl>  <dbl> <dbl> <dbl>
## 1     2015-08-21  5140 481.0601    179    10 28000
## 2     2016-02-17  8549 268.1229    175     0 10000
## 3     2016-06-18  7783 265.9597    174    10 10000
## 4     2016-09-17  8076 263.7397    175     1 10000
## 5     2016-10-19  8236 262.5647    175     1 10000
## 6     2016-04-15  8051 260.7369    170    10 10000

2015-08-21 is the outlier date.

Running the summary() function on all the records:

##     room_id            host_id                    room_type     
##  Min.   :     958   Min.   :       46                  :    43  
##  1st Qu.: 2433743   1st Qu.:  2597104   Entire home/apt:121681  
##  Median : 6988943   Median :  8539143   Private room   : 76404  
##  Mean   : 7020337   Mean   : 18201785   Shared room    :  7404  
##  3rd Qu.:10772374   3rd Qu.: 25805964                           
##  Max.   :19781990   Max.   :139553832                           
##                     NA's   :6                                   
##                 neighborhood       reviews       overall_satisfaction
##  Mission              : 25487   Min.   :  0.00   Min.   :0.0         
##  Western Addition     : 20223   1st Qu.:  1.00   1st Qu.:4.5         
##  South of Market      : 15860   Median :  5.00   Median :5.0         
##  Castro/Upper Market  : 12098   Mean   : 21.07   Mean   :4.4         
##  Downtown/Civic Center: 11404   3rd Qu.: 22.00   3rd Qu.:5.0         
##  Haight Ashbury       : 10524   Max.   :513.00   Max.   :5.0         
##  (Other)              :109936   NA's   :49       NA's   :62326       
##   accommodates       bedrooms          price          minstay       
##  Min.   : 1.000   Min.   : 0.000   Min.   :    0   Min.   :   1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.:  108   1st Qu.:   1.00  
##  Median : 2.000   Median : 1.000   Median :  167   Median :   2.00  
##  Mean   : 3.082   Mean   : 1.346   Mean   :  252   Mean   :   3.53  
##  3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:  256   3rd Qu.:   3.00  
##  Max.   :18.000   Max.   :10.000   Max.   :30000   Max.   :1000.00  
##  NA's   :8421     NA's   :10690                    NA's   :71477    
##     latitude       longitude                         last_modified   
##  Min.   :37.71   Min.   :-122.5   2015-08-21 16:54:47.397989:  3974  
##  1st Qu.:37.75   1st Qu.:-122.4   48:03.8                   :    24  
##  Median :37.77   Median :-122.4   25:46.6                   :    18  
##  Mean   :37.77   Mean   :-122.4   49:41.8                   :    18  
##  3rd Qu.:37.79   3rd Qu.:-122.4   40:27.5                   :    17  
##  Max.   :37.83   Max.   :-122.4   43:16.5                   :    17  
##                                   (Other)                   :201464  
##  date_collected       price_per_bedroom
##  Min.   :2013-11-17   Min.   :    0.0  
##  1st Qu.:2016-02-17   1st Qu.:   95.0  
##  Median :2016-07-17   Median :  130.0  
##  Mean   :2016-07-02   Mean   :  171.5  
##  3rd Qu.:2017-01-14   3rd Qu.:  189.0  
##  Max.   :2017-07-10   Max.   :28000.0  
##                       NA's   :26169

Running the summary() function on the 2015-08-21 records:

##     room_id           host_id                   room_type   
##  Min.   :   5193   Min.   :      46                  :   0  
##  1st Qu.:1334904   1st Qu.: 1622704   Entire home/apt:3027  
##  Median :3578486   Median : 5488994   Private room   :1879  
##  Mean   :3692542   Mean   : 9725157   Shared room    : 234  
##  3rd Qu.:6062615   3rd Qu.:14061126                         
##  Max.   :7983070   Max.   :42076683                         
##                                                             
##               neighborhood     reviews       overall_satisfaction
##  Mission            : 696   Min.   :  0.00   Min.   :1.00        
##  Western Addition   : 544   1st Qu.:  1.00   1st Qu.:4.50        
##  South of Market    : 369   Median :  7.00   Median :5.00        
##  Castro/Upper Market: 340   Mean   : 21.22   Mean   :4.74        
##  Haight Ashbury     : 286   3rd Qu.: 26.00   3rd Qu.:5.00        
##  Bernal Heights     : 267   Max.   :371.00   Max.   :5.00        
##  (Other)            :2638   NA's   :49       NA's   :945         
##   accommodates       bedrooms         price            minstay       
##  Min.   : 1.000   Min.   : 0.00   Min.   :   10.0   Min.   :  1.000  
##  1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.:  124.8   1st Qu.:  1.000  
##  Median : 2.000   Median : 1.00   Median :  179.0   Median :  2.000  
##  Mean   : 2.583   Mean   : 1.33   Mean   :  481.1   Mean   :  4.938  
##  3rd Qu.: 4.000   3rd Qu.: 2.00   3rd Qu.:  285.0   3rd Qu.:  3.000  
##  Max.   :16.000   Max.   :10.00   Max.   :28000.0   Max.   :365.000  
##  NA's   :650      NA's   :10                        NA's   :318      
##     latitude       longitude                         last_modified 
##  Min.   :37.71   Min.   :-122.5   2015-08-21 16:54:47.397989:3974  
##  1st Qu.:37.75   1st Qu.:-122.4   2015-08-21 16:56:54.438494:   1  
##  Median :37.77   Median :-122.4   2015-08-21 16:56:54.449790:   1  
##  Mean   :37.77   Mean   :-122.4   2015-08-21 16:56:54.455704:   1  
##  3rd Qu.:37.78   3rd Qu.:-122.4   2015-08-21 16:56:54.458357:   1  
##  Max.   :37.81   Max.   :-122.4   2015-08-21 16:56:54.460898:   1  
##                                   (Other)                   :1161  
##  date_collected       price_per_bedroom
##  Min.   :2015-08-21   Min.   :   10.0  
##  1st Qu.:2015-08-21   1st Qu.:  100.0  
##  Median :2015-08-21   Median :  140.0  
##  Mean   :2015-08-21   Mean   :  364.3  
##  3rd Qu.:2015-08-21   3rd Qu.:  199.0  
##  Max.   :2015-08-21   Max.   :28000.0  
##                       NA's   :358

Here I’m mostly interested in comparing quantiles for price and price_per_bedroom. These variables are not drastically different between the 2015-08-21 records and all records.

The boxplots indicate that there are more outliers above price 1000 for 2015-08-21 compared to the other dates.

Looking at the three bar charts above, we see that the August 2015 date does not have exceptionally many records with price higher than the top 5% prices in the whole dataset, but the August 2015 records do have a very large amount of the top 2% of prices in the dataset, and almost three times as many of the 1% top pricesas any other date in the dataset. Since this is showing absolute counts, and some of the later dates have more records than the August 2015 date, this becomes even more significant.

Judging from the above plot, the August_2015 price followed the general price trend up to about the 90th percentile. Let’s see if we can get some more details by applying log transformation to the plot.

We get a little bit more details from the log transformation, but the trend stays the same:the August 2015 price follows the general trend until about the 90th percentile, when the August 2015 price starts to become much higher than for the same percentiles in the whole dataset. I will plot this one last time, zooming in on the top 8 percentiles.

Compared to the August_2015, the 92nd to 99th percentiles for all_records are very flat. From the 95th percentile, the August_2015 values are many times larger than the values for all_records, only being surpassed at the 100th percentile.

With the most extreme outliers removed, The August 2015 price mean are much closer to the mean of the other dates.

All of the neighborhoods have a median overall_satisfaction score of either 4.5 or 5. The mean score varies a little more, although all but two neighborhoods score 4 or better.

Based on the above plot it looks like price has a large impact on the overall_satisfaction score. There’s also a rather large gap between the median and mean prices, especially for overall_satisfaction score 1.

Keep in mind that only 1.5% of the records have an overall_satisfaction score of between 1 and 3.5, and 7.5% of the records have a score of 0. It is therefore most interesting to look at the price differences in the 4 to 5 overall_satisfaction range, where there’s a fairly clear trend: pricier units receive better scores.

The plot showing mean overall_satisfaction over time has a strange dip from the end of 2016 until the end of the date range. I suspect this has something to do with the remaining 0 values in the dataset. I’ll take a closer look at those next.

Not a single record before 2016-12-23 has an overall_score of 0. Since there is a small chance Airbnb actually allowed 0 overall_satisfaction scores starting late 2016, I won’t do anything further with these values. I will instead avoid using this variable in the rest of my analysis.

Unsurprisingly, units with more bedrooms cost more. The same trend is true for accomodations: the more people a unit can accomodate, the pricier the unit tend to be.

Due to the amount of dates, this is not very easy to read. Let’s make a line chart instead.

With the line chart we see a clearer picture. The proportion of Entire home/apt units have stayed fairly consistent throughout the time period, while private room units have increased and shared room units have decreased. I also notice that a few dates at the end of 2015 and in the beginning of 2016 lacks room_type information.

## # A tibble: 4 x 2
##   date_collected     n
##           <date> <int>
## 1     2015-10-21    36
## 2     2015-11-21     1
## 3     2015-12-14     3
## 4     2016-01-16     3

It is reassuring to see that both of the two room types Private room and Shared room have a great majority of 1 bedroom units.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There is a fairly clear relationship between price and neighborhood. This makes sense, as some neighborhoods are more desirable than others. Based on local knowledge about San Francisco, it seems like the more central neighborhoods have more expensive units.

Except for some outlier price points in August 2015, the mean and median prices did not vary much over the 4 year period. The prices at the last date of data collections were actually lower than at most other dates.

When it comes to overall_satisfaction, there’s a fairly clear trend for the higher scores: more expensive units receive a higher score. Be aware that the overall_satisfaction data is rather fragile, and I had to exclude about 30% of the records from overall_satisfaction analysis due to null values.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There appear to be a relationship between neighborhood and overall satisfaction. However, the correlation is not very strong, as all but 3 neighborhoods’ mean score fall within 0.6 points of eachother.

Less interesting but worth mentioning is that there is a clear correlation between price and the number of bedrooms and people a unit can accommodate. Considering both of these variable function as proxies for unit size, this indicates the following finding: larger units cost more.

What was the strongest relationship you found?

There are clear correlations between price and the number of bedrooms and price, and price and people a unit can accommodate. Both accommodation count and number of bedrooms function as proxies for unit size, and indicate the following finding: larger units cost more.

Multivariate Plots Section

Entire home/apt units tend to have a higher price than private rooms, while shared rooms have the lowest prices of the three room_type categories. This make sense, as entire homes tend to be larger than private rooms. While less clear, entire homes also tend to have a higher per bedroom price.

The prices of entire home units have increased somewhat since the early days. Private rooms first have become a little bit cheaper, while prices for shared rooms, excluding one date in August 2015, have been fairly stable.

In the plots above I have excluded the units with 5 or more bedrooms (top 1%). With the exception of 4 room units, the median price for all bedroom sizes has stayed stable over time. The mean price is affected by the large price outliers in August 2015. Excluding that date, the mean price for 0-2 bedroom units have stayed stable. The 3 room units had a mean price increase from 2013 until early 2016, and has since seen a slow mean price decrease of about $100. The mean price for 4 room units saw a sharp increase until early 2016, when it stabilized around $800.

For room types Entire home/apt and private room, the median price trend of neighborhoods seem to match the overall trend, with a few exceptions. The price of shared rooms vary much more across neighborhoods, which I suspect is partially due to a low amount of data points in some neighborhoods.

Looking at this, Presidio stands out. While price per bedroom is fairly stable throughout the time period, the unit price varies drastically. Let’s take a closer look at Presidio to figure out what is going on.

Looking at the summary statistics, there seem to be a clear problem with drawing any conclusions from the Presidio data: the sample size per date is just too small, with most dates having less than 10 records. Let’s take a look at the sample sizes for all neighborhoods.

## # A tibble: 37 x 2
##           neighborhood     n
##                 <fctr> <int>
##  1            Presidio   155
##  2 Treasure Island/YBI   262
##  3    Golden Gate Park   354
##  4            Seacliff   429
##  5     Diamond Heights   446
##  6      Crocker Amazon   616
##  7   Visitacion Valley   735
##  8    Presidio Heights  1039
##  9           Lakeshore  1278
## 10           Glen Park  1889
## # ... with 27 more rows

Presidio is clearly the most troublesome neighborhood in this regard, having nearly 60% fewer records than the second lowest neighborhood. However, some of the other neightborhoods might also have too few records to be statistically viable. Let’s take a closer look at the records with less than 1000 total records.

To some extent all of these are troubling. With the exception of Crocker Amazon, all these neighborhoods have some dates with less than 10 records. However, only the Presidio has consistently around 10 or fewer records per date over time.

Excluding the low-record neighborhoods shows a clearer relationship between median price and median price per bedroom for all neighborhoods. I also notice that except for the first date, which we know has many fewer records compared to the other dates, the median price per bedroom within neighborhoods stayed fairly stable throughout the time period. The same is true for median price, except for Presidio Heights, which had some peaks on certain dates.

It looks to me like the most affordable units are widely distrbuted throughout all the neighborhoods, while the most expensive units can mostly be found in the most central parts of the city.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Neighborhood and price have a clear correlation, even after including time as a variable. Both median and mean prices vary a fair amount from neighborhood to neighborhood, and prices within neighborhoods stayed fairly stable throughout the date range.

Were there any interesting or surprising interactions between features?

I found it interesting how a relatively large dataset can still be too small for detailed analysis when you include multiple variables. The clearest example of this was how I had to exclude certain neighborhoods from my analysis due to having too few data points (less than 10) on certain dates.


Final Plots and Summary

Plot One

Description One

In this plot the max price is set to 1000, which excludes the 2% most expensive units. Even when the large outliers are removed, the price data is still largely right-skewed.

Plot Two

Description Two

All the neighborhoods have a higher mean than median. Some neighborhoods have drastically higher mean value than median value, indicating right-skewed data or an issue with outliers (a large amount of outliers or some very large outliers, or both.)

Plot Three

Description Three

The first price quartile has a much wider distribution compared to the rest. For the fourth price quartile the units are densely located in the more central parts of the city.


Reflection

Exploring the Airbnb data from San Francisco has been an interesting endeavor. Although I do not think I have discovered any dramatic new findings, there are plenty of interesting observations to be made. As an example, I find it interesting how price has stayed fairly stagnant throughout the four years, while the amount of rental units on the market have increased dramatically.

Since I chose to find a dataset on my own I had to be extra wary of potential data quality issues. Several times throughout my exploratory analysis I came across suspicious values, dips and peaks. This led to lengthy examinations, but in the end I felt more confident about what parts of the data were trustworthy, and which variables were too uncertain to continue focusing on.

The clearest relationships in the dataset are also the most obvious ones, namely the correlation between price and the proxies for rental unit size (bedroom count and how many people the unit can accommodate). There is also a clear correlation between price and neighborhood.

I have intentionally chosen not to look at changes for institution_id and room_id over time, as I have considered it to be outside the scope of this project. It would be interesting to find out how ownership changes over time, and to see if any of the variables for individual housing units ever change.